{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    " 集成学习的特性使其应用到不平衡学习是很自然的选择，接下来主要对基于样本采样技术的集成学习做介绍\n",
    " \n",
    " ### 一.Bagging\n",
    " \n",
    "BalancedBagging的过程可以简单描述如下：   \n",
    "\n",
    "![avatar](./source/asBagging.png)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from imblearn.ensemble import BalancedBaggingClassifier "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = make_classification(n_classes=2, class_sep=2,\n",
    "weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n",
    "n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BalancedBaggingClassifier(base_estimator=None, bootstrap=True,\n",
       "                          bootstrap_features=False, max_features=1.0,\n",
       "                          max_samples=1.0, n_estimators=10, n_jobs=1,\n",
       "                          oob_score=False, random_state=42, ratio=None,\n",
       "                          replacement=False, sampling_strategy='auto',\n",
       "                          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                    random_state=0)\n",
    "bbc = BalancedBaggingClassifier(random_state=42)\n",
    "bbc.fit(X_train, y_train) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 23,   0],\n",
       "       [  2, 225]], dtype=int64)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = bbc.predict(X_test)\n",
    "confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有时离群样本点往往会坏在某些维度上(取值特别高/特别低)，可以借用RandomForest的思路再对特征做一个随机抽取，生成不同的随机子空间(random subspace,后面简称RS),这时流程如下：   \n",
    "![avatar](./source/asBagging_FS.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BalancedRandomForestClassifier(bootstrap=True, class_weight=None,\n",
       "                               criterion='gini', max_depth=2,\n",
       "                               max_features='auto', max_leaf_nodes=None,\n",
       "                               min_impurity_decrease=0.0, min_samples_leaf=2,\n",
       "                               min_samples_split=2,\n",
       "                               min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "                               n_jobs=1, oob_score=False, random_state=0,\n",
       "                               replacement=False, sampling_strategy='auto',\n",
       "                               verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.ensemble import BalancedRandomForestClassifier\n",
    "brfc = BalancedRandomForestClassifier(max_depth=2, random_state=0)\n",
    "brfc.fit(X_train, y_train) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 23,   0],\n",
       "       [  2, 225]], dtype=int64)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = brfc.predict(X_test)\n",
    "confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BalancedBaggingClassifier(base_estimator=None, bootstrap=True,\n",
       "                          bootstrap_features=True, max_features=1.0,\n",
       "                          max_samples=1.0, n_estimators=10, n_jobs=1,\n",
       "                          oob_score=False, random_state=42, ratio=None,\n",
       "                          replacement=False, sampling_strategy='auto',\n",
       "                          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbc = BalancedBaggingClassifier(random_state=42,bootstrap_features=True)\n",
    "bbc.fit(X_train, y_train) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 23,   0],\n",
       "       [  2, 225]], dtype=int64)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = bbc.predict(X_test)\n",
    "confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**小结**：上面流程其实是框架性质的：  \n",
    "\n",
    "（1）对于采样其实可以任意选择，比如随机欠采样，聚类采样等都可以；   \n",
    "\n",
    "（2）对于RS其实还有改进空间，比如可以选择差异性较大的子空间，比如按相关性度量对特征维度进行聚类，然后从每个类中选择一个有代表性的特征出来构造新的特征子空间；另外lgb中特征选择方式也可以参考"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二.Boosting\n",
    "boosting比较有代表的就是RUSBoost和SMOTEBoost了，顾名思义，RUSBoost表示random undersampling+adaboost，而SMOTEBoost表示smote+adaboost  \n",
    "\n",
    "**RUSBoost的流程如下：**   \n",
    "![avatar](./source/RUSBoost.png)  \n",
    "\n",
    "\n",
    "**SMOTEBoost的流程如下：**   \n",
    "\n",
    "![avatar](./source/smoteboost.png)  \n",
    "\n",
    "\n",
    "可见一个是对样本中多数类做欠采样，另一个是对少数类做过采样，其本质都是增加少数类的权重，其中样本权重的更新以及学习器的权重与adaboost一致"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 23,   0],\n",
       "       [  2, 225]], dtype=int64)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.ensemble import RUSBoostClassifier\n",
    "rbc = RUSBoostClassifier(random_state=0)\n",
    "rbc.fit(X_train, y_train)\n",
    "y_pred = rbc.predict(X_test)\n",
    "confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 三.Bagging&Boosting\n",
    "\n",
    "Bagging和Boosting也可以结合起来使用，比较常用的手段是Bagging嵌套Boosting，比较有代表的模型是EasyEnsemble以及BalancedCascade\n",
    "\n",
    "**EasyEnsemble流程如下：**    \n",
    "\n",
    "首先采用RUS随机生成多个平衡的训练子集，然后再在每个训练子集上分别训练一个AdaBoost，最后再对这些Adaboost做Bagging\n",
    "\n",
    "![avatar](./source/easyensemble.svg)  \n",
    "\n",
    "\n",
    "**BalanceCascade流程如下：**   \n",
    "\n",
    "它是对EasyEnsemble的改进，每一个AdaBoost的训练会排除之前已经正确分类的多数类，所以BalanceCascade只能以串行的方式训练   \n",
    "\n",
    "![avatar](./source/BalancedCascade.svg)  \n",
    "\n",
    "注意：上面Ti表示bias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 23,   0],\n",
       "       [  2, 225]], dtype=int64)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.ensemble import EasyEnsembleClassifier\n",
    "eec = EasyEnsembleClassifier(random_state=42)\n",
    "eec.fit(X_train, y_train)\n",
    "y_pred = eec.predict(X_test)\n",
    "confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 四.其他集成方法\n",
    "\n",
    "对于代价敏感学习以及决策输出补偿也可以尝试集成学习的方法，比如前面介绍的AdaCost可看做基于代价敏感学习的一种集成学习方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
